2-class Internal Cross-validation Pruned Eigen Transformation Classification Trees
نویسندگان
چکیده
In [3] it has been demonstrated that decision trees built in a feature space yielded by some eigen transformation can be competitive with industry standards. Unfortunately, the selection of such a transformation and the dimension of the feature space that should be retained is not self-evident. These trees however have interesting properties that can be exploited. Since the order of the splits is fixed due to the known importance of each feature given by their corresponding eigenvalues, all trees are pruned versions of the largest tree. This property makes it possible to prune such a tree based on an internal cross-validation using the training data. This allows us to use a technique that should overfit less than for example the estimated error rates used in C4.5 classification trees for pruning [4], while still using the entire training data to build the tree. We therefore present an algorithm that divides the training data into folds similar to a cross-validation. The split values are calculated for each of the obtained internal training folds. The nodes of the tree are then evaluated with the corresponding internal test folds and nodes that overfit are pruned. This is done using each of the eigen transformations. The best tree is selected and the final split values are then calculated for the selected pruned tree based on the entire training data. Results show that we can expect trees that are optimal or near optimal if there is enough training data relative to the size of the tree.
منابع مشابه
A Bayes' Theorem Based Approach for the Selection of Best Pruned Tree
Decision tree pruning is critical for the construction of good decision trees. The most popular and widely used method among various pruning methods is cost-complexity pruning, whose implementation requires a training dataset to develop a full tree and a validation dataset to prune the tree. However, different pruned trees are found to be produced when the original dataset are randomly partitio...
متن کاملEvaluation of Decision Tree Pruning with Subadditive Penalties
Recent work on decision tree pruning [1] has brought to the attention of the machine learning community the fact that, in classification problems, the use of subadditive penalties in cost-complexity pruning has a stronger theoretical basis than the usual additive penalty terms. We implement cost-complexity pruning algorithms with general size-dependent penalties to confirm the results of [1]. N...
متن کاملIdentifying diagnostic errors with induced decision trees.
OBJECTIVE The purpose of this article is to compare the diagnostic accuracy of induced decision trees with that of pruned neural networks and to improve the accuracy and interpretation of breast cancer diagnosis from readings of thin-needle aspirate by identifying cases likely to be misclassified by induced decision rules. METHOD Using an online database consisting of 699 cases of suspected b...
متن کاملNew Approaches to Classification in Remote Sensing Using Homogeneous and Hybrid Decision Trees to Map Land Cover
Decision tree classification procedures have been largely overlooked in remote sensing applications. In this paper we compare the classification performance of three types of decision trees across three different data sets. The classifiers that are considered include a univariate decision tree, multivariate decision tree, and a hybrid decision tree. Results from an n-fold cross-validation proce...
متن کاملOn the VC-Dimension of Univariate Decision Trees
In this paper, we give and prove lower bounds of the VC-dimension of the univariate decision tree hypothesis class. The VC-dimension of the univariate decision tree depends on the VC-dimension values of its subtrees and the number of inputs. In our previous work (Aslan et al., 2009), we proposed a search algorithm that calculates the VC-dimension of univariate decision trees exhaustively. Using...
متن کامل